Change “your name” in the YAML header above to your name.
As usual, enter the examples in code chunks and run them, unless told otherwise.
Read R4ds Chapter 10: Tibbles, sections 1-3.
Load the tidyverse package.
library("tidyverse")
library("nycflights13")
package 㤼㸱nycflights13㤼㸲 was built under R version 3.5.2
Enter your code chunks for Section 10.2 here.
Describe what each chunk code does.
# 10.2 Creating Tibbles
as_tibble(iris)
# 10.2
# New tibbles can be created using tibble()
tibble(x = 1:5, y = 1, z = x ^ 2 + y)
# Tibble style data frames can use non-syntactic names such as numbers for names or unusual symbols that are not letters or numbers
tb <- tibble(`:)` = "smile", ` ` = "space",`2000` = "number")
tb
Enter your code chunks for Section 10.3 here.
Describe what each chunk code does.
# 10.3 Printing
# Tibbles only show the first 10 rows of data and however many columns can fit on screen. Each column of data reports its type.
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
#
nycflights13::flights %>%
print(n = 10, width = Inf)
# a view of the entire data set
nycflights13::flights %>%
View()
# 10.3.2
#
df <- tibble(
x = runif(5),
y = rnorm(5)
)
df$x
[1] 0.1332958 0.1191248 0.6962651 0.2879061 0.7305983
df[["x"]]
[1] 0.1332958 0.1191248 0.6962651 0.2879061 0.7305983
df[[1]]
[1] 0.1332958 0.1191248 0.6962651 0.2879061 0.7305983
Answer the questions completely. Use code chunks, text, or both, as necessary.
1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame). Identify at least two ways to tell if an object is a tibble. Hint: What does as_tibble() do? What does class() do? What does str() do?
#10.5
#Printing mtcars as a regular data frame, and then printing it as a tibble data frame
print(mtcars)
as_tibble(mtcars)
class(mtcars)
[1] "data.frame"
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
2: Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] a
Levels: a
df[, "xyz"]
[1] a
Levels: a
df[, c("abc", "xyz")]
tb <- tibble(abc = 1, xyz = "a")
tb$x
Unknown or uninitialised column: 'x'.
NULL
tb[, "xyz"]
tb[, c("abc", "xyz")]
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Nothing to do here unless you took a break and need to reload tidyverse.
Do not run the first code chunk of this section, which begins with heights <- read_csv("data/heights.csv"). You do not have that data file so the code will not run.
Enter and run the remaining chunks in this section.
#11.2
# read_csv reads comma delimited files. csv files can be exported by programs like excel and imported into R. csv uses the first line of data for the column names.
read_csv("a,b,c
1,2,3
4,5,6")
# you can use skip = n to skip to a specified ine, with n being the number of lines skipped.
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
# comment = "#" can be used to drop all lines beginning with #
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
# col_names = FALSE will cause R not to treat the first row as headings and sequentially labels then X1 to Xn
read_csv("1,2,3\n4,5,6", col_names = FALSE)
# Column names can also be given with the col_names function
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
# Missing values in your data can be recognized as missing by R by using the na function to specify a symbol used in place of missing data
read_csv("a,b,c\n1,2,.", na = ".")
1: What function would you use to read a file where fields were separated with “|”?
#I would use read_delim() since csv works with commas and tsv works with whitespace.
2: (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file. You only need to worry about the delimiter. Do not worry about other arguments. Replace the dots in each line with the rest of your code.
file <- read_delim("file.csv", .,.,.)
file <- read_delim("file.tsv", . . .)
3: What are the two most important arguments to read_fwf()? Why?
# The two most important arguments are file, becuase otherwise no data coul dbe read and displayed, and the start,end argument for specifing positions to start and end form in the file.
4: Skip this question
5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
#personal notes
# write_csv and write_tsv - export files
# Always use UTF_8 encoding
# write_excel_csv() export file excel can read
Read R4ds Chapter 18: Pipes, sections 1-3.
#Personal notes
# %>% is a pipe from the magrittr package (Comes with tidyverse)
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
Nothing to do here unless you took a break and need to reload the tidyverse.
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
# Personal notes for tidy data - each variable must have its own column, each obervation must have its own row, ach value must have its own cell
Read and run the examples through section 12.3.1 (gathering), including the example with left_join(). We’ll cover joins later.
# Data not tidy
table4a
# Numerical Years are placed under a column "years" rather than representing there own columns. The numbered years are gathered into
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
# Table4b data is gathered as well
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
# Data combined into a single tibble from tables 4a and 4b
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
2: Why does this code fail? Fix it so it works.
table4a %>%
gather(1999, 2000, key = "year", value = "cases")
Error in inds_combine(.vars, ind_list) : Position must be between 0 and n
That is all for Chapter 12. On to the last chapter.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load the necessary libraries. As usual, type the examples into and run the code chunks.
#data set
flights
# 5.1.2 (4-letter abbreviations under column names)
# int stands for integers.
# dbl stands for doubles, or real numbers.
# chr stands for character vectors, or strings.
# dttm stands for date-times (a date + a time).
# 5.1.3 - data manipulation
# Pick observations by their values (filter()).
# Reorder the rows (arrange()).
# Pick variables by their names (select()).
# Create new variables with functions of existing variables (mutate()).
# Collapse many values down to a single summary (summarise()).
# These can all be used in conjunction with group_by()
filter()Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
#Personal notes
# 5.2 filter()
# Looking at a subset of observation based on their values, in our case it is month 1 and day 1
filter(flights, month == 1, day == 1)
#For saving results use the assignment operator <-. Since I used month 1 and day one, the variable will be called jan1
jan1 <- filter(flights, month == 1, day == 1)
# Printing and results and saving the value(s) toa variable can be combined
(dec25 <- filter(flights, month == 12, day == 25))
1.1: Find all flights with a delay of 2 hours or more.
#Flights that had a delay of 2 hours or more
filter(flights, dep_delay >= 2)
1.2: Flew to Houston (IAH or HOU)
filter(flights, dest == 'HOU')
1.3: Were operated by United (UA), American (AA), or Delta (DL).
# Flights operated by Delta
filter(flights, carrier == 'DL')
1.4: Departed in summer (July, August, and September).
filter(flights, month %in% c(6, 7, 8))
1.5: Arrived more than two hours late, but didn’t leave late.
filter(flights, arr_delay > 2, dep_delay == 0)
1.6: Flights were delayed by at least an hour, but made up over 30 minutes in flight. This is a tricky one. Do your best.
filter(flights, arr_delay >= 1, minute > 30)
1.7: Departed between midnight and 6am (inclusive)
#Flights departed bewteen midnight and 6am
filter(flights, dep_time > 0, arr_time < 360)
2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
# between() is a shortcut for x >= left and x <= right. Since we are comparing two different columns,
between(flights, dep_time > 0, arr_time < 360)
Error in between(flights, dep_time > 0, arr_time < 360) :
object 'dep_time' not found
3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
#8,255 flight are missing departure times.
filter(flights, is.na(year))
filter(flights, is.na(month))
filter(flights, is.na(day))
filter(flights, is.na(sched_dep_time))
filter(flights, is.na(dep_delay))
filter(flights, is.na(arr_time))
filter(flights, is.na(sched_arr_time))
filter(flights, is.na(arr_delay))
filter(flights, is.na(carrier))
filter(flights, is.na(tailnum))
filter(flights, is.na(origin))
filter(flights, is.na(dest))
filter(flights, is.na(air_time))
filter(flights, is.na(distance))
filter(flights, is.na(hour))
filter(flights, is.na(minute))
filter(flights, is.na(time_hour))
4: Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA^0
[1] 1
NA | TRUE
[1] TRUE
Note: For some context, see this thread
arrange()# Data order is rearranged with the arange function
arrange(flights, year, month, day)
# the desc() funcion re-orders columns in descending order
arrange(flights, desc(dep_delay))
1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). Note: This one should still have the earliest departure dates after the NAs. Hint: What does desc() do?
arrange(flights, desc('NA'), is.na(dep_delay))
Error in arrange_impl(.data, dots) :
incorrect size (1) at position 1, expecting : 336776
2: Sort flights to find the most delayed flights. Find the flights that left earliest.
# the desc() funcion re-orders the data in columns in descending order for the chosen variable
arrange(flights, desc(dep_delay))
arrange(flights, dep_delay)
This question is asking for the flights that were most delayed (left latest after scheduled departure time) and least delayed (left ahead of scheduled time).
3: Sort flights to find the fastest flights. Interpret fastest to mean shortest time in the air.
arrange(flights, air_time)
Optional challenge: fastest flight could refer to fastest air speed. Speed is measured in miles per hour but time is minutes. Arrange the data by fastest air speed.
4: Which flights travelled the longest? Which travelled the shortest?
# Shorest flights arranged
arrange(flights, air_time)
#Longest flights arranged
arrange(flights, desc(air_time))
select()#given example from the reading
select(flights, year, month, day)
1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, dep_delay, arr_time, arr_delay, dep_time)
select(flights, arr_time, arr_delay, dep_time, dep_delay)
2: What happens if you include the name of a variable multiple times in a select() call?
select(flights, dep_time, dep_delay, arr_time, arr_delay, dep_time)
3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
# one_of() matches variable names in a character vector.
one_of(vars <- c("year", "month", "day", "dep_delay", "arr_delay"))
Error: No tidyselect variables were registered
[90mCall `rlang::last_error()` to see a backtrace[39m
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
#The following code did not surprise me.
select(flights, contains("TIME"))
time_flights <- select(flights, contains("TIME"))
time_flights